The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Fri Jun 12 08:54:24 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Fri Jun 12 08:54:24 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.11.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.11.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.11.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.11.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 142 142
Albania 142 142
Algeria 142 142
Andorra 142 142
Angola 142 142
Antigua and Barbuda 142 142
Argentina 142 142
Armenia 142 142
Australia 1136 1136
Austria 142 142
Azerbaijan 142 142
Bahamas 142 142
Bahrain 142 142
Bangladesh 142 142
Barbados 142 142
Belarus 142 142
Belgium 142 142
Belize 142 142
Benin 142 142
Bhutan 142 142
Bolivia 142 142
Bosnia and Herzegovina 142 142
Botswana 142 142
Brazil 142 142
Brunei 142 142
Bulgaria 142 142
Burkina Faso 142 142
Burma 142 142
Burundi 142 142
Cabo Verde 142 142
Cambodia 142 142
Cameroon 142 142
Canada 1988 1988
Central African Republic 142 142
Chad 142 142
Chile 142 142
China 4686 4686
Colombia 142 142
Comoros 142 142
Congo (Brazzaville) 142 142
Congo (Kinshasa) 142 142
Costa Rica 142 142
Cote d’Ivoire 142 142
Croatia 142 142
Cuba 142 142
Cyprus 142 142
Czechia 142 142
Denmark 426 426
Diamond Princess 142 142
Djibouti 142 142
Dominica 142 142
Dominican Republic 142 142
Ecuador 142 142
Egypt 142 142
El Salvador 142 142
Equatorial Guinea 142 142
Eritrea 142 142
Estonia 142 142
Eswatini 142 142
Ethiopia 142 142
Fiji 142 142
Finland 142 142
France 1562 1562
Gabon 142 142
Gambia 142 142
Georgia 142 142
Germany 142 142
Ghana 142 142
Greece 142 142
Grenada 142 142
Guatemala 142 142
Guinea 142 142
Guinea-Bissau 142 142
Guyana 142 142
Haiti 142 142
Holy See 142 142
Honduras 142 142
Hungary 142 142
Iceland 142 142
India 142 142
Indonesia 142 142
Iran 142 142
Iraq 142 142
Ireland 142 142
Israel 142 142
Italy 142 142
Jamaica 142 142
Japan 142 142
Jordan 142 142
Kazakhstan 142 142
Kenya 142 142
Korea, South 142 142
Kosovo 142 142
Kuwait 142 142
Kyrgyzstan 142 142
Laos 142 142
Latvia 142 142
Lebanon 142 142
Lesotho 142 142
Liberia 142 142
Libya 142 142
Liechtenstein 142 142
Lithuania 142 142
Luxembourg 142 142
Madagascar 142 142
Malawi 142 142
Malaysia 142 142
Maldives 142 142
Mali 142 142
Malta 142 142
Mauritania 142 142
Mauritius 142 142
Mexico 142 142
Moldova 142 142
Monaco 142 142
Mongolia 142 142
Montenegro 142 142
Morocco 142 142
Mozambique 142 142
MS Zaandam 142 142
Namibia 142 142
Nepal 142 142
Netherlands 710 710
New Zealand 142 142
Nicaragua 142 142
Niger 142 142
Nigeria 142 142
North Macedonia 142 142
Norway 142 142
Oman 142 142
Pakistan 142 142
Panama 142 142
Papua New Guinea 142 142
Paraguay 142 142
Peru 142 142
Philippines 142 142
Poland 142 142
Portugal 142 142
Qatar 142 142
Romania 142 142
Russia 142 142
Rwanda 142 142
Saint Kitts and Nevis 142 142
Saint Lucia 142 142
Saint Vincent and the Grenadines 142 142
San Marino 142 142
Sao Tome and Principe 142 142
Saudi Arabia 142 142
Senegal 142 142
Serbia 142 142
Seychelles 142 142
Sierra Leone 142 142
Singapore 142 142
Slovakia 142 142
Slovenia 142 142
Somalia 142 142
South Africa 142 142
South Sudan 142 142
Spain 142 142
Sri Lanka 142 142
Sudan 142 142
Suriname 142 142
Sweden 142 142
Switzerland 142 142
Syria 142 142
Taiwan* 142 142
Tajikistan 142 142
Tanzania 142 142
Thailand 142 142
Timor-Leste 142 142
Togo 142 142
Trinidad and Tobago 142 142
Tunisia 142 142
Turkey 142 142
Uganda 142 142
Ukraine 142 142
United Arab Emirates 142 142
United Kingdom 1562 1562
Uruguay 142 142
US 142 142
US_state 463062 463062
Uzbekistan 142 142
Venezuela 142 142
Vietnam 142 142
West Bank and Gaza 142 142
Western Sahara 142 142
Yemen 142 142
Zambia 142 142
Zimbabwe 142 142
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5349
Alaska 1054
Arizona 1316
Arkansas 5655
California 5016
Colorado 4847
Connecticut 781
Delaware 327
Diamond Princess 87
District of Columbia 88
Florida 5654
Georgia 12747
Grand Princess 88
Guam 88
Hawaii 452
Idaho 2584
Illinois 7454
Indiana 7389
Iowa 6980
Kansas 5876
Kentucky 8386
Louisiana 5320
Maine 1355
Maryland 2057
Massachusetts 1309
Michigan 6506
Minnesota 6218
Mississippi 6601
Missouri 7471
Montana 2310
Nebraska 4495
Nevada 1041
New Hampshire 928
New Jersey 1971
New Mexico 2292
New York 5019
North Carolina 7836
North Dakota 2789
Northern Mariana Islands 73
Ohio 6973
Oklahoma 5378
Oregon 2685
Pennsylvania 5472
Puerto Rico 88
Rhode Island 520
South Carolina 3835
South Dakota 3535
Tennessee 7562
Texas 16280
Utah 1201
Vermont 1245
Virgin Islands 88
Virginia 9932
Washington 3424
West Virginia 3702
Wisconsin 5354
Wyoming 1665
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 56315
China 142
Diamond Princess 123
Korea, South 113
Japan 112
Italy 110
Iran 107
Singapore 104
France 103
Germany 103
Spain 102
US 101
Switzerland 99
United Kingdom 99
Belgium 98
Netherlands 98
Norway 98
Sweden 98
Austria 96
Malaysia 95
Australia 94
Bahrain 94
Denmark 94
Canada 93
Qatar 93
Iceland 92
Brazil 91
Czechia 91
Finland 91
Greece 91
Iraq 91
Israel 91
Portugal 91
Slovenia 91
Egypt 90
Estonia 90
India 90
Ireland 90
Kuwait 90
Philippines 90
Poland 90
Romania 90
Saudi Arabia 90
Indonesia 89
Lebanon 89
Pakistan 89
San Marino 89
Thailand 89
Chile 88
Luxembourg 87
Peru 87
Russia 87
Ecuador 86
Mexico 86
Slovakia 86
South Africa 86
United Arab Emirates 86
Armenia 85
Colombia 85
Croatia 85
Panama 85
Serbia 85
Taiwan* 85
Turkey 85
Argentina 84
Bulgaria 84
Latvia 84
Uruguay 84
Algeria 83
Costa Rica 83
Dominican Republic 83
Hungary 83
Andorra 82
Bosnia and Herzegovina 82
Jordan 82
Lithuania 82
Morocco 82
New Zealand 82
North Macedonia 82
Vietnam 82
Albania 81
Cyprus 81
Malta 81
Moldova 81
Brunei 80
Burkina Faso 80
Sri Lanka 80
Tunisia 80
Ukraine 79
Azerbaijan 78
Ghana 78
Kazakhstan 78
Oman 78
Senegal 78
Venezuela 78
Afghanistan 77
Cote d’Ivoire 77
Cuba 76
Mauritius 76
Uzbekistan 76
Cambodia 75
Cameroon 75
Honduras 75
Nigeria 75
West Bank and Gaza 75
Belarus 74
Georgia 74
Bolivia 73
Kosovo 73
Kyrgyzstan 73
Montenegro 73
Congo (Kinshasa) 72
Kenya 71
Niger 70
Guinea 69
Rwanda 69
Trinidad and Tobago 69
Paraguay 68
Bangladesh 67
Djibouti 65
El Salvador 64
Guatemala 63
Madagascar 62
Mali 61
Congo (Brazzaville) 58
Jamaica 58
Gabon 56
Somalia 56
Tanzania 56
Ethiopia 55
Burma 54
Sudan 53
Liberia 52
Maldives 50
Equatorial Guinea 49
Cabo Verde 47
Sierra Leone 45
Guinea-Bissau 44
Togo 44
Zambia 43
Eswatini 42
Chad 41
Tajikistan 40
Haiti 38
Sao Tome and Principe 38
Benin 36
Nepal 36
Uganda 36
Central African Republic 35
South Sudan 35
Guyana 33
Mozambique 32
Yemen 28
Mongolia 27
Mauritania 24
Nicaragua 24
Malawi 18
Syria 18
Zimbabwe 16
Bahamas 15
Libya 15
Comoros 13
Suriname 5
Angola 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 142
Korea, South 113
Japan 112
Italy 110
Iran 107
Singapore 104
France 103
Germany 103
Spain 102
US 101
Switzerland 99
United Kingdom 99
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-06-05        18418
## 2    Afghanistan           <NA> <NA> 2020-04-08        18360
## 3    Afghanistan           <NA> <NA> 2020-04-12        18364
## 4    Afghanistan           <NA> <NA> 2020-04-07        18359
## 5    Afghanistan           <NA> <NA> 2020-06-04        18417
## 6    Afghanistan           <NA> <NA> 2020-05-21        18403
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                    309                 18969     0.01628974
## 2                     14                   444     0.03153153
## 3                     18                   607     0.02965404
## 4                     14                   423     0.03309693
## 5                    300                 18054     0.01661682
## 6                    193                  8676     0.02224527
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  4.278044                   2.489958        18348
## 2                  2.647383                   1.146128        18348
## 3                  2.783189                   1.255273        18348
## 4                  2.626340                   1.146128        18348
## 5                  4.256573                   2.477121        18348
## 6                  3.938320                   2.285557        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             70  NA   NA         NA                           NA
## 2             12  NA   NA         NA                           NA
## 3             16  NA   NA         NA                           NA
## 4             11  NA   NA         NA                           NA
## 5             69  NA   NA         NA                           NA
## 6             55  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9999788 -34.0
Hawaii Retail_Recreation 0.9922959 -56.0
New Hampshire Parks 0.9530303 -20.0
Vermont Parks 0.9336770 -35.5
Maine Transit -0.9153180 -50.0
Utah Residential -0.8888658 12.0
Connecticut Grocery_Pharmacy -0.8861382 -6.0
Hawaii Parks 0.8363104 -72.0
Utah Transit -0.8291928 -18.0
South Dakota Parks 0.8212771 -26.0
Hawaii Transit 0.7884139 -89.0
Wyoming Parks -0.7849623 -4.0
Alaska Workplace -0.7783669 -33.0
Arizona Grocery_Pharmacy -0.7751668 -15.0
Rhode Island Workplace -0.7648509 -39.5
Massachusetts Workplace -0.7523893 -39.0
Connecticut Transit -0.7477857 -50.0
Vermont Grocery_Pharmacy -0.6576427 -25.0
New York Workplace -0.6515408 -34.5
Utah Parks -0.6344390 17.0
Arizona Residential 0.6324102 13.0
Rhode Island Retail_Recreation -0.6264114 -45.0
Alaska Grocery_Pharmacy -0.6220264 -7.0
Alaska Residential 0.6191911 13.0
North Dakota Parks 0.6172542 -34.0
Hawaii Residential -0.6151452 19.0
New Jersey Parks -0.6035832 -6.0
Rhode Island Residential -0.6030669 18.5
Nevada Transit -0.5989959 -20.0
Maine Workplace -0.5974248 -30.0
New York Retail_Recreation -0.5890553 -46.0
Utah Workplace -0.5832217 -37.0
Nebraska Workplace 0.5828620 -32.0
Arizona Retail_Recreation -0.5721797 -42.5
New Jersey Workplace -0.5429925 -44.0
Wyoming Transit -0.5376244 -17.0
Idaho Residential -0.5363728 11.0
New York Parks 0.5295690 20.0
Connecticut Retail_Recreation -0.5138229 -45.0
Connecticut Residential 0.5112331 14.0
Kansas Parks 0.4986140 72.0
Massachusetts Retail_Recreation -0.4972982 -44.0
Maine Parks 0.4894549 -31.0
New Jersey Grocery_Pharmacy -0.4834600 2.5
New Mexico Grocery_Pharmacy -0.4810636 -11.0
Montana Parks -0.4796329 -58.0
Connecticut Workplace -0.4742959 -39.0
Montana Residential 0.4717634 14.0
Iowa Parks -0.4717463 28.5
West Virginia Parks 0.4701947 -33.0
Nebraska Residential -0.4666259 14.0
Rhode Island Parks 0.4526207 52.0
New Mexico Residential 0.4512141 13.5
Maryland Workplace -0.4491352 -35.0
North Dakota Retail_Recreation -0.4471316 -42.0
New Jersey Retail_Recreation -0.4368898 -62.5
Illinois Transit -0.4310289 -31.0
Arizona Transit 0.4308371 -38.0
Pennsylvania Workplace -0.4237247 -36.0
South Carolina Workplace 0.4157476 -30.0
New Mexico Parks 0.4146560 -31.5
Maryland Grocery_Pharmacy -0.4091531 -10.0
Massachusetts Grocery_Pharmacy -0.4091048 -7.0
New Jersey Transit -0.4075140 -50.5
New Hampshire Residential -0.4017444 14.0
Vermont Residential 0.3940837 11.5
Michigan Parks 0.3893993 28.5
Pennsylvania Retail_Recreation -0.3865474 -45.0
Alabama Grocery_Pharmacy -0.3845963 -2.0
Arkansas Parks 0.3827358 -12.0
Alabama Workplace -0.3811043 -29.0
Oregon Workplace -0.3776803 -31.0
Kentucky Parks -0.3750956 28.5
Wyoming Workplace -0.3747455 -31.0
New York Transit -0.3688685 -48.0
New Mexico Retail_Recreation -0.3520013 -42.5
Idaho Grocery_Pharmacy -0.3486839 -5.5
Wisconsin Transit -0.3481425 -23.5
Idaho Workplace -0.3466799 -29.0
Oregon Parks -0.3462938 16.5
Nebraska Grocery_Pharmacy 0.3415999 -0.5
California Transit -0.3307846 -42.0
Wyoming Grocery_Pharmacy -0.3242293 -10.0
Hawaii Workplace 0.3231885 -46.0
Maryland Retail_Recreation -0.3222845 -39.0
Arkansas Retail_Recreation -0.3216719 -30.0
North Dakota Workplace 0.3162939 -40.0
Virginia Transit -0.3157728 -33.0
California Residential 0.3153552 14.0
Idaho Transit -0.3101230 -30.0
Maine Retail_Recreation -0.3027213 -42.0
California Parks -0.2997769 -38.5
Minnesota Transit -0.2942195 -28.5
Illinois Workplace -0.2895620 -30.5
Colorado Residential 0.2861959 14.0
Florida Residential 0.2853415 14.0
Pennsylvania Parks 0.2823193 12.0
Missouri Residential -0.2816092 13.0
Montana Transit 0.2810459 -41.0
Mississippi Residential 0.2810255 13.0
Nevada Residential 0.2805279 17.0
North Carolina Grocery_Pharmacy 0.2803763 0.0
Vermont Retail_Recreation 0.2802302 -57.0
North Carolina Workplace 0.2695368 -31.0
Georgia Grocery_Pharmacy -0.2658482 -10.0
South Carolina Parks -0.2652238 -23.0
Alabama Transit -0.2648984 -36.5
Rhode Island Grocery_Pharmacy 0.2634588 -7.5
Texas Workplace 0.2593114 -32.0
Vermont Workplace -0.2588909 -43.0
Maryland Residential 0.2582406 15.0
Pennsylvania Grocery_Pharmacy -0.2575567 -6.0
Kansas Workplace 0.2569225 -32.5
Texas Residential -0.2543552 15.0
Rhode Island Transit -0.2528446 -56.0
Illinois Parks 0.2512331 26.5
Tennessee Workplace -0.2492550 -31.0
Florida Parks -0.2432877 -43.0
Alaska Retail_Recreation 0.2395662 -39.0
Tennessee Residential 0.2364085 11.5
New York Grocery_Pharmacy -0.2301335 8.0
West Virginia Grocery_Pharmacy -0.2251599 -6.0
Washington Grocery_Pharmacy 0.2243587 -7.0
Wisconsin Parks 0.2232476 51.5
Georgia Workplace -0.2231884 -33.5
Nevada Retail_Recreation -0.2206094 -43.0
South Dakota Workplace 0.2132724 -35.0
North Dakota Grocery_Pharmacy -0.2114182 -8.0
Minnesota Parks 0.2087876 -9.0
Georgia Retail_Recreation -0.2071400 -41.0
Montana Workplace -0.2062976 -40.0
Idaho Retail_Recreation -0.1994250 -39.5
West Virginia Workplace 0.1991519 -33.0
North Carolina Transit 0.1985732 -32.0
Utah Retail_Recreation -0.1961675 -40.0
Mississippi Grocery_Pharmacy -0.1952519 -8.0
Kansas Grocery_Pharmacy -0.1951776 -14.0
Illinois Residential 0.1941120 14.0
Oregon Transit 0.1937067 -27.5
Arizona Parks -0.1908245 -44.5
Colorado Parks -0.1903452 2.0
North Carolina Residential 0.1889768 13.0
Arkansas Residential 0.1882963 12.0
Virginia Residential 0.1859312 14.0
Texas Parks 0.1821099 -42.0
Connecticut Parks 0.1815763 43.0
Wisconsin Residential -0.1776527 14.0
Nebraska Transit -0.1735733 -9.0
Missouri Workplace 0.1730787 -28.0
Kentucky Transit -0.1713968 -31.0
Kentucky Grocery_Pharmacy 0.1706162 4.0
Indiana Parks -0.1700524 29.0
Pennsylvania Transit -0.1692302 -42.0
Oregon Residential 0.1659095 10.5
Nevada Workplace -0.1642381 -40.0
New Hampshire Retail_Recreation -0.1628477 -41.0
New Jersey Residential 0.1628383 18.0
Indiana Residential 0.1623577 12.0
Massachusetts Parks 0.1618841 39.0
New Mexico Transit 0.1606930 -38.5
Ohio Transit 0.1565773 -28.0
Iowa Transit 0.1524402 -24.0
South Dakota Residential 0.1485329 15.0
Michigan Workplace -0.1478876 -40.0
Indiana Retail_Recreation 0.1445007 -38.0
Virginia Grocery_Pharmacy -0.1418520 -8.0
California Grocery_Pharmacy -0.1397192 -11.5
Alabama Parks 0.1390553 -1.0
Mississippi Retail_Recreation -0.1327979 -40.0
Minnesota Retail_Recreation 0.1302458 -40.0
Tennessee Parks -0.1288353 10.5
Wisconsin Workplace -0.1285068 -31.0
North Dakota Residential -0.1269050 17.0
North Dakota Transit 0.1254749 -48.0
Florida Retail_Recreation 0.1250284 -43.0
California Retail_Recreation -0.1245487 -44.0
Missouri Transit -0.1228531 -24.5
Texas Grocery_Pharmacy 0.1221446 -14.0
Mississippi Transit -0.1180793 -38.5
Wyoming Residential 0.1180478 12.5
Wisconsin Grocery_Pharmacy 0.1174008 -1.0
Missouri Parks 0.1137470 0.0
Georgia Residential -0.1114710 13.0
Kentucky Residential 0.1110315 12.0
North Carolina Retail_Recreation 0.1093711 -34.0
Nebraska Retail_Recreation 0.1077230 -36.0
South Dakota Transit -0.1073351 -40.0
Nevada Parks 0.1059216 -12.5
Massachusetts Transit -0.1056549 -45.0
Virginia Parks 0.1055830 6.0
Idaho Parks 0.1047149 -22.0
Arkansas Workplace -0.1046286 -26.0
New Hampshire Grocery_Pharmacy -0.1025858 -6.0
Oregon Grocery_Pharmacy 0.1021927 -7.0
Massachusetts Residential 0.1017675 15.0
Kansas Transit -0.1016290 -26.5
Oklahoma Grocery_Pharmacy -0.1000828 -0.5
Michigan Retail_Recreation -0.0999028 -53.0
Maryland Transit -0.0992443 -39.0
Alabama Retail_Recreation 0.0984950 -39.0
Michigan Transit 0.0966554 -46.0
Ohio Residential 0.0959556 14.0
Iowa Workplace 0.0952537 -30.0
Oklahoma Parks 0.0944498 -18.5
Washington Workplace -0.0935464 -38.0
Michigan Residential 0.0935114 15.0
Texas Transit 0.0924182 -41.0
New York Residential 0.0912760 17.5
Indiana Workplace 0.0882015 -34.0
Mississippi Parks -0.0876805 -25.0
Oklahoma Workplace 0.0870034 -31.0
Ohio Parks -0.0869618 67.5
Virginia Workplace -0.0867909 -31.5
Georgia Parks 0.0845830 -6.0
Utah Grocery_Pharmacy 0.0844606 -4.0
North Carolina Parks -0.0807240 7.0
Minnesota Workplace -0.0802130 -33.0
Oregon Retail_Recreation -0.0798551 -40.5
Iowa Grocery_Pharmacy -0.0791425 4.0
Pennsylvania Residential 0.0783106 15.0
Minnesota Residential -0.0763405 17.0
Virginia Retail_Recreation -0.0752809 -35.0
Washington Residential 0.0747169 13.0
Minnesota Grocery_Pharmacy 0.0740046 -6.0
Ohio Grocery_Pharmacy 0.0733723 0.0
Kentucky Retail_Recreation 0.0723322 -29.0
Maine Residential -0.0683959 11.0
Florida Grocery_Pharmacy 0.0670691 -14.0
California Workplace -0.0660173 -36.0
Alabama Residential -0.0650277 11.0
Colorado Transit 0.0647247 -36.0
West Virginia Transit -0.0630719 -45.0
Texas Retail_Recreation 0.0600219 -40.0
Washington Transit -0.0561229 -33.5
West Virginia Residential -0.0547248 11.0
Vermont Transit -0.0540271 -63.0
South Carolina Transit 0.0517456 -45.0
Washington Parks 0.0514921 -3.5
Ohio Retail_Recreation 0.0513874 -36.0
Indiana Grocery_Pharmacy -0.0512868 -5.5
New Hampshire Workplace 0.0480237 -37.0
Kentucky Workplace -0.0456446 -36.5
Oklahoma Residential 0.0448371 15.0
Indiana Transit 0.0423123 -29.0
Illinois Grocery_Pharmacy -0.0408169 2.0
Mississippi Workplace -0.0402548 -33.0
Ohio Workplace -0.0361783 -35.0
South Carolina Grocery_Pharmacy 0.0345764 1.0
South Dakota Retail_Recreation -0.0328386 -39.0
Wisconsin Retail_Recreation 0.0310009 -44.0
Maine Grocery_Pharmacy -0.0306653 -13.0
Wyoming Retail_Recreation -0.0287217 -39.0
Georgia Transit -0.0285867 -35.0
Colorado Grocery_Pharmacy -0.0281447 -17.0
Michigan Grocery_Pharmacy -0.0280182 -11.0
Iowa Retail_Recreation 0.0274725 -38.0
New Mexico Workplace 0.0271814 -34.0
Florida Transit -0.0270896 -49.0
Missouri Retail_Recreation -0.0261404 -36.5
South Carolina Residential -0.0259587 12.0
Illinois Retail_Recreation 0.0243136 -40.0
Montana Grocery_Pharmacy -0.0229603 -16.0
West Virginia Retail_Recreation -0.0229557 -38.5
Arkansas Transit 0.0216486 -27.0
Tennessee Grocery_Pharmacy 0.0212917 6.0
Arizona Workplace -0.0203444 -35.0
Iowa Residential -0.0202131 13.0
Kansas Residential -0.0195269 13.0
Florida Workplace -0.0194170 -33.0
Colorado Retail_Recreation -0.0193849 -44.0
Colorado Workplace 0.0186891 -39.0
Arkansas Grocery_Pharmacy -0.0175011 3.0
Tennessee Transit 0.0169037 -32.0
Washington Retail_Recreation 0.0163743 -42.0
Oklahoma Transit 0.0158671 -26.0
Montana Retail_Recreation 0.0154367 -51.0
Maryland Parks 0.0130171 27.0
Tennessee Retail_Recreation -0.0121802 -30.0
Oklahoma Retail_Recreation 0.0104534 -31.0
Nebraska Parks 0.0103334 55.5
Nevada Grocery_Pharmacy 0.0092523 -12.5
Kansas Retail_Recreation -0.0080117 -38.0
New Hampshire Transit 0.0057686 -57.0
South Dakota Grocery_Pharmacy -0.0045838 -9.0
Missouri Grocery_Pharmacy 0.0021279 2.0
South Carolina Retail_Recreation 0.0008872 -35.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Fri Jun 12 08:55:55 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net